-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add StatefulSet Volume Expansion Kep #660
Add StatefulSet Volume Expansion Kep #660
Conversation
/assign @janetkuo |
|
||
### Risks and Mitigations | ||
|
||
Under the `ExpandPersistentVolumes` feature, pods referencing a volume must be restarted for file system expansion to occur after the `FileSystemResizePending` condition is True on the persistent volume claim. If the StatefulSet is configured with the `RollingUpdate` update strategy, then the StatefulSet controller would need to wait for the `FileSystemResizePending` condition to be satisfied on each persistent volume claim referenced by the pod it is updating before restarting the pod. This could noticeably increase the update time of a StatefulSet with many replicas. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can check if the ExpandInUsePersistentVolumes
feature gate enabled, and restart the pods immediately without waiting controller side resizing finished if it is enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed we can. That would just translate into not running the following check if that feature flag in enabled in the current implementation: https://github.com/kubernetes/kubernetes/pull/71384/files#diff-988743e0eb9313bcfa8fd43ac3a0977fR229
@kubernetes/sig-apps-feature-requests @kubernetes/sig-storage-api-reviews |
|
||
While updating a pod, the StatefulSet controller will update a referenced persistent volume claim object if its storage request in the associated `volumeClaimTemplate` has been increased. | ||
|
||
The pod will be restarted after the `FileSystemResizePending` condition is True on all updated persistent volume claims it references. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "The StatefulSet controller will delete and recreate the pod after..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if ExpandInUsePersistentVolumes
feature is enabled, we don't need to kill the pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking into this right now and to not kill the pod we would have to recognize that only the the volumeClaimTemplate storage requests have been modified and ExpandInUsePersistentVolumes is enabled. This would be achieved by comparing the current and updated version of the statefulsets which are formed by applying the patches in the updateRevision and currentRevision to a representation of the statefulset (in this case it's just named set).
The issue is that the representation of the statefulset these patches are applied to seems to be the updated version of the StatefulSet. This is problematic whilst forming the current representation of the statefulset. For example, if the updated representation of the set has two labels {"label1": "value1", "label2": "value2"}
while the current version only has 1 label {"label1": "value1"}. Applying the currentRevision patch yields a set with labels {"label1": "value1", "label2": "value2"}
which is not the current version of the StatefulSet. This issue is problematic as it makes comparisons between the currentSet and updateSet unreliable to determine if termination of pods is required(as we are trying to achieve above).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can start with destroying the Pod and recreating it. It's a trade off in that workloads that need to rebuild a large in memory working set from persistent data would greatly benefit from in place resizing. However, all other operations that operate on a StatefulSet Pod currently require that the Pod be destroyed and recreated. If we want to support in place mutation later we can do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, that's also how its currently implemented in the PR so I will leave that as is. I'll make changes to this proposal to reflect that.
While updating a pod, the StatefulSet controller will update a referenced persistent volume claim object if its storage request in the associated `volumeClaimTemplate` has been increased. | ||
|
||
The pod will be restarted after the `FileSystemResizePending` condition is True on all updated persistent volume claims it references. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VolumeClaimTemplate
should be recorded in the StatefulSet's ControllerRevision
object as well for rollback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Being done in implementation forgot to record it here 🤦♂️
|
||
The apiserver will allow for increases to storage requests in the `volumeClaimTemplates` component of a StatefulSet. | ||
|
||
During the StatefulSet update process, the StatefulSet controller will detect an update to a `volumeClaimTemplate` by comparing the updated and current revision of the StatefulSet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change should be detected by looking at the StatefulSet's ControllerRevision
objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The updateSet is formed by applying the updateRevision (ControllerRevision) to the current set. That should be adequate right?
Under the `ExpandPersistentVolumes` feature, pods referencing a volume must be restarted for file system expansion to occur after the `FileSystemResizePending` condition is True on the persistent volume claim. If the StatefulSet is configured with the `RollingUpdate` update strategy, then the StatefulSet controller would need to wait for the `FileSystemResizePending` condition to be satisfied on each persistent volume claim referenced by the pod it is updating before restarting the pod. This could noticeably increase the update time of a StatefulSet with many replicas. | ||
|
||
A potential mitigation would be the eventual adoption of the `ExpandInUsePersistentVolumes` alpha feature in Kubernetes v1.11 which enables file system expansion on volumes being used by a pod. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's optional, but you can include the name of the feature gate of this new feature in this proposal, e.g. StatefulSetVolumeExpansion
(just a candidate, feel free to propose other names).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StatefulSetVolumeExpansion
sound good to me :)
8920739
to
d9ea0b2
Compare
/assign @gnufied |
|
||
While updating a pod, the StatefulSet controller will update a referenced persistent volume claim object if its storage request in the associated `volumeClaimTemplate` has been increased. | ||
|
||
The Statefulset controller will delete and recreate the pod after the `FileSystemResizePending` condition is True on all updated persistent volume claims it references. However, if the `ExpandInUsePersistentVolumes` feature is enabled, then deleting the pod is unnecessary to complete file system expansion on updated persistent volumes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I like this bit. It kinda circumvents the fact that currently volume resizing is offline
by default and it kinda implements a redundant mechanism that will become unnecessary when online resizing graduates to beta.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least for alpha feature I propose skipping this mechanism because once we implement this - user will come to rely on it and it will be harder to go back on it. We should also recognize the fact that certain volume types can not be resized when in-use (like AzureDisk for example). Currently this information is surfaced to the user via PVC events if a PVC can't be resized when used in a pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I like this bit. It kinda circumvents the fact that currently volume resizing is
offline
by default and it kinda implements a redundant mechanism that will become unnecessary when online resizing graduates to beta.
Not sure what mechanism you are referring to here. Is it the mechanism of detecting the Alpha feature being enabled and not killing the pod? Or are you suggesting that we don't kill the pod in any case since online resizing will eventually graduate to beta.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I am suggesting we skip killing the pods in stateful set controller since online resizing will eventually graduate to beta (most likely in 1.14 release itself).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh ok makes sense. I was just confused because it is killing the pod by default and a mechanism would need to be implemented to ensure that the pod is not killed at the end of the update loop*.
We should also recognize the fact that certain volume types can not be resized when in-use (like AzureDisk for example). Currently this information is surfaced to the user via PVC events if a PVC can't be resized when used in a pod.
So do we want to make an exception in this case to kill the pod from the controller or just update and let the event surface on the PVCs(and possibly add an additional indicator) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to start with always restarting the Pod after the volume increase request is fulfilled. (As below) the workloads controllers don't to in place updates for any other container or Pod modifications today. We can (maybe should) consider optimizing to detect and implement in place updates at a later point in time.
|
||
The Statefulset controller will delete and recreate the pod after the `FileSystemResizePending` condition is True on all updated persistent volume claims it references. However, if the `ExpandInUsePersistentVolumes` feature is enabled, then deleting the pod is unnecessary to complete file system expansion on updated persistent volumes. | ||
|
||
The functionality provided by this enhancement will be gated by the `StatefulSetVolumeExpansion` feature gate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to think about the patch mechanism used to apply controller revisions and the implications to rollback (both of StatefulSets and Kubernetes API Servers) when this feature is enabled. Also, you to think about modifications to validation that are necessary to allow mutation of the VolumeClaimsTemplate and how that effects backward compatibility during upgrade. It may be necessary to relax the constraint in the API server one release prior to adding the expansion feature into StatefulSet to enable upgrade in multi-API Server clusters and to allow for safe downgrades/rollbacks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the implementation PR, the apiserver only allows increases to storageRequests if the feature is enabled (relaxing validation). Mentioned it here in this proposal.
I don't think there should be issues with backwards compatibility for the api, unless a client has (possibly inadvertently) built functionality that relied upon receiving an error response while modifying storageRequests in volumeClaimTemplates before this feature is added.
In regards to patches, for the implementation PR I just added the volumeClaimTemplates to the patch. However, rollbacks seem to be a problem since only volume expansion is supported. So rolling back to a previous revision after expanding a volume would imply the client is attempting to shrink volumes (which is unsupported). Currently the apiserver will err if the client attempts such a rollback and relay to the client that storage requests can only be increased. This can be an issue if the client has attempted a volume expansion, which for some reason failed on all PVCs, and then attempts to rollback the change. In all other cases of rolling back volume expansion changes (partial or full successes) I'm not sure if a rollback should be allowed. On the other hand, this can be problematic if the client attempts to rollback some other change but is prevented from using that functionality as they had also expanded a volumeClaimTemplate in the same or any following revision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For PVC expansion feature, we also check if expansion is explicitly enabled in SC via an admission controller. Will this be possible to do that here for Stateful sets? It looks like - if feature gate is enabled volume expansion will generally be available for all volume types which is problematic (since not all volume types support expansion).
The other tricky bit this design does not address is - how it will handle expansion of volumes that only can be expanded offline. For PVC themselves, it is not that big of deal. But for stateful sets - user must take the Pod offline before volume that the pod was using can be expanded. So in effect - will this mean scaling entire stateful set to 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other tricky bit this design does not address is - how it will handle expansion of volumes that only can be expanded offline. For PVC themselves, it is not that big of deal. But for stateful sets - user must take the Pod offline before volume that the pod was using can be expanded. So in effect - will this mean scaling entire stateful set to 0?
As it says in the design The Statefulset controller will delete and recreate the pod after the FileSystemResizePending condition is True on all updated persistent volume claims it references.
. Thus it will wait until that condition is reached before deleting the pod. This is how I have it implemented in the PR. This way the pod will continue to be available until all updated PVCs that must be expanded offline have the FileSystemResizePending
condition (and be deleted and recreated after that).
Let me know if this addresses that concern.
For PVC expansion feature, we also check if expansion is explicitly enabled in SC via an admission controller. Will this be possible to do that here for Stateful sets? It looks like - if feature gate is enabled volume expansion will generally be available for all volume types which is problematic (since not all volume types support expansion).
I did see the PersistentVolumeClaimResize
admission controller when initially implementing this. I wasn't sure if it would be necessary to build a similar(perhaps redundant) admission controller for a statefulset as when the statefulset controller eventually attempts to update the pvc it would catch the error sent by the admission controller and the update to the statefulset would not succeed. On the other hand building a specific admission controller for this would ensure that such an update to a statefulset is never validated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if this addresses that concern.
Hmm not entirely. Thing is for some volume types(Azure Disk for example), the volume can not be expanded via control-plane call if volume is being used. So literally no expansion can happen to volume while it is in use. The user must first detach the volume from the node before he can call Azure API call that expands the volume. So in this case - there will not be a FileSystemResizePending
added to PVC because control-plane call itself has not succeeded (and will not succeed until volume is taken offline).
I wasn't sure if it would be necessary to build a similar(perhaps redundant) admission controller for a statefulset as when the statefulset controller eventually attempts to update the pvc it would catch the error sent by the admission controller and the update to the statefulset would not succeed.
I think, this is already too late. api-server has at this point "accepted" the API request that it can not fulfill. Will stateful set controller not retry PVC update if first update has failed? The idea behind rejecting the request as early as possible is - controllers shouldn't have to keep retrying it. It may be worth expanding existing admission controller to reject stateful set updates in similar fashion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thing is for some volume types(Azure Disk for example), the volume can not be expanded via control-plane call if volume is being used. So literally no expansion can happen to volume while it is in use.
Oh ok. I read through this PR and I think I get what you mean now. I see two issues with trying to get it to work with such volumes types.
-
For each pvc template in the statefulset it would now be necessary to query the api for its storage class to find the provisioner. Then it would be possible to determine how to apply the update. This is doable.
-
The statefulset controller is structured so that all updates to associated resources and the pod spec occur before the pod it deleted and recreated (as this makes sense in almost every context when updating a statefulset). However, to support the volume types above we would be required to delete the pod before its associated PVCs can be updated. This changes/breaks the flow of how the statefulset is currently updated even though it should be possible to implement.
So in effect - will this mean scaling entire stateful set to 0?
Well since there is a 1:1 map from PVC to PV monotonic updates should still be possible. (delete pod, update pvc request, wait until condition reached, restart pod).
@kow3ns how do you think this should this be addressed
|
||
A potential mitigation would be the eventual adoption of the `ExpandInUsePersistentVolumes` feature, which enables file system expansion on in-use volumes thus, making it unecessary to delete and recreate any refrencing pods during volume expansion. | ||
|
||
## Graduation Criteria |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might want to add a bit more (as suggested above) about we plan to roll the feature out and how we plan to test it.
|
||
Move to Alpha after initial implementation and approvals. | ||
|
||
Move to Beta once feature has been in Alpha for a set duration with no issues being reported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also like a notional outline of GA graduation criteria. Another concern that should be addressed is that, as the modification effects a GA API, enabling by default at Beta will effectively release it globally. This should at least be called out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is a notional outline? 😅 Is it an outline which attempts to identify possible scenarios and the suggested approach to handling them?
I don't see any API changes proposed so I don't think this needs API approvers. I'd like to discuss this at SIG Apps. Generally, I think its something that is necessary to do in order to support the volume resizing feature implemented at that storage layer, and I think we should (pending a few changes) merge it as accepted and work on getting it to implementable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove any references to NEXT_KEP_NUMBER
and rename the KEP to just be the draft date and KEP title.
KEP numbers will be obsolete once #703 merges.
@kow3ns since this relies on relaxing an API validation, it was decided to run this by api reviewers as well. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA. It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen |
@travisghansen: You can't reopen an issue/PR unless you authored it or you are a collaborator. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/remove-lifecycle rotten |
So...where is this at? |
See here #2842 |
Add Kep for StatefulSet Volume Expansion.
Wrote this up quickly after the discussion here:
kubernetes/kubernetes#71384
Feature tracking issue: #661
Let me know if something is missing or if this is the right way to merge this 🤔